141 research outputs found
A Novel BiLevel Paradigm for Image-to-Image Translation
Image-to-image (I2I) translation is a pixel-level mapping that requires a
large number of paired training data and often suffers from the problems of
high diversity and strong category bias in image scenes. In order to tackle
these problems, we propose a novel BiLevel (BiL) learning paradigm that
alternates the learning of two models, respectively at an instance-specific
(IS) and a general-purpose (GP) level. In each scene, the IS model learns to
maintain the specific scene attributes. It is initialized by the GP model that
learns from all the scenes to obtain the generalizable translation knowledge.
This GP initialization gives the IS model an efficient starting point, thus
enabling its fast adaptation to the new scene with scarce training data. We
conduct extensive I2I translation experiments on human face and street view
datasets. Quantitative results validate that our approach can significantly
boost the performance of classical I2I translation models, such as PG2 and
Pix2Pix. Our visualization results show both higher image quality and more
appropriate instance-specific details, e.g., the translated image of a person
looks more like that person in terms of identity
Learning a Disentangled Embedding for Monocular 3D Shape Retrieval and Pose Estimation
We propose a novel approach to jointly perform 3D shape retrieval and pose
estimation from monocular images.In order to make the method robust to
real-world image variations, e.g. complex textures and backgrounds, we learn an
embedding space from 3D data that only includes the relevant information,
namely the shape and pose. Our approach explicitly disentangles a shape vector
and a pose vector, which alleviates both pose bias for 3D shape retrieval and
categorical bias for pose estimation. We then train a CNN to map the images to
this embedding space, and then retrieve the closest 3D shape from the database
and estimate the 6D pose of the object. Our method achieves 10.3 median error
for pose estimation and 0.592 top-1-accuracy for category agnostic 3D object
retrieval on the Pascal3D+ dataset, outperforming the previous state-of-the-art
methods on both tasks
Make the U in UDA Matter: Invariant Consistency Learning for Unsupervised Domain Adaptation
Domain Adaptation (DA) is always challenged by the spurious correlation
between domain-invariant features (e.g., class identity) and domain-specific
features (e.g., environment) that does not generalize to the target domain.
Unfortunately, even enriched with additional unsupervised target domains,
existing Unsupervised DA (UDA) methods still suffer from it. This is because
the source domain supervision only considers the target domain samples as
auxiliary data (e.g., by pseudo-labeling), yet the inherent distribution in the
target domain -- where the valuable de-correlation clues hide -- is
disregarded. We propose to make the U in UDA matter by giving equal status to
the two domains. Specifically, we learn an invariant classifier whose
prediction is simultaneously consistent with the labels in the source domain
and clusters in the target domain, hence the spurious correlation inconsistent
in the target domain is removed. We dub our approach "Invariant CONsistency
learning" (ICON). Extensive experiments show that ICON achieves the
state-of-the-art performance on the classic UDA benchmarks: Office-Home and
VisDA-2017, and outperforms all the conventional methods on the challenging
WILDS 2.0 benchmark. Codes are in https://github.com/yue-zhongqi/ICON.Comment: Accepted by NeurIPS 202
Attention-based Class Activation Diffusion for Weakly-Supervised Semantic Segmentation
Extracting class activation maps (CAM) is a key step for weakly-supervised
semantic segmentation (WSSS). The CAM of convolution neural networks fails to
capture long-range feature dependency on the image and result in the coverage
on only foreground object parts, i.e., a lot of false negatives. An intuitive
solution is ``coupling'' the CAM with the long-range attention matrix of visual
transformers (ViT) We find that the direct ``coupling'', e.g., pixel-wise
multiplication of attention and activation, achieves a more global coverage (on
the foreground), but unfortunately goes with a great increase of false
positives, i.e., background pixels are mistakenly included. This paper aims to
tackle this issue. It proposes a new method to couple CAM and Attention matrix
in a probabilistic Diffusion way, and dub it AD-CAM. Intuitively, it integrates
ViT attention and CAM activation in a conservative and convincing way.
Conservative is achieved by refining the attention between a pair of pixels
based on their respective attentions to common neighbors, where the intuition
is two pixels having very different neighborhoods are rarely dependent, i.e.,
their attention should be reduced. Convincing is achieved by diffusing a
pixel's activation to its neighbors (on the CAM) in proportion to the
corresponding attentions (on the AM). In experiments, our results on two
challenging WSSS benchmarks PASCAL VOC and MS~COCO show that AD-CAM as pseudo
labels can yield stronger WSSS models than the state-of-the-art variants of
CAM
Class-Incremental Exemplar Compression for Class-Incremental Learning
Exemplar-based class-incremental learning (CIL) finetunes the model with all
samples of new classes but few-shot exemplars of old classes in each
incremental phase, where the "few-shot" abides by the limited memory budget. In
this paper, we break this "few-shot" limit based on a simple yet surprisingly
effective idea: compressing exemplars by downsampling non-discriminative pixels
and saving "many-shot" compressed exemplars in the memory. Without needing any
manual annotation, we achieve this compression by generating 0-1 masks on
discriminative pixels from class activation maps (CAM). We propose an adaptive
mask generation model called class-incremental masking (CIM) to explicitly
resolve two difficulties of using CAM: 1) transforming the heatmaps of CAM to
0-1 masks with an arbitrary threshold leads to a trade-off between the coverage
on discriminative pixels and the quantity of exemplars, as the total memory is
fixed; and 2) optimal thresholds vary for different object classes, which is
particularly obvious in the dynamic environment of CIL. We optimize the CIM
model alternatively with the conventional CIL model through a bilevel
optimization problem. We conduct extensive experiments on high-resolution CIL
benchmarks including Food-101, ImageNet-100, and ImageNet-1000, and show that
using the compressed exemplars by CIM can achieve a new state-of-the-art CIL
accuracy, e.g., 4.8 percentage points higher than FOSTER on 10-Phase
ImageNet-1000. Our code is available at https://github.com/xfflzl/CIM-CIL.Comment: Accepted to CVPR 202
Visual Commonsense R-CNN
We present a novel unsupervised feature representation learning method,
Visual Commonsense Region-based Convolutional Neural Network (VC R-CNN), to
serve as an improved visual region encoder for high-level tasks such as
captioning and VQA. Given a set of detected object regions in an image (e.g.,
using Faster R-CNN), like any other unsupervised feature learning methods
(e.g., word2vec), the proxy training objective of VC R-CNN is to predict the
contextual objects of a region. However, they are fundamentally different: the
prediction of VC R-CNN is by using causal intervention: P(Y|do(X)), while
others are by using the conventional likelihood: P(Y|X). This is also the core
reason why VC R-CNN can learn "sense-making" knowledge like chair can be sat --
while not just "common" co-occurrences such as chair is likely to exist if
table is observed. We extensively apply VC R-CNN features in prevailing models
of three popular tasks: Image Captioning, VQA, and VCR, and observe consistent
performance boosts across them, achieving many new state-of-the-arts. Code and
feature are available at https://github.com/Wangt-CN/VC-R-CNN.Comment: Accepted by CVPR 202
- …